Peptide Grouping Performance Benchmark using SWATH-MS Interlab Data¶
- License: Creative Commons Attribution-NonCommercial 4.0 International License
- Version: 0.1
- Edit Log:
- 2025-10-07: Initial version of the notebook
Requirements:
The following preprocessing and method runs must be completed (in order) before executing this notebook:
01-DataProcessing.R— Prepares SWATH‑MS Interlab data and writes processed datasets todata/prepared/as.featherfiles02-runCOPF.R— Runs COPF; saves results todata/results/asCOPF_{dataID}_result.feather04-runProteoForge.py— Runs ProteoForge; saves results todata/results/asProteoForge_{dataID}_result.feather
Expected files: data/results/COPF_*, data/results/ProteoForge_* (feather format)
Note: Feather (
.feather) is a fast, cross-language binary table format readable from both R and Python—ideal for exchanging processed data, but not human-readable.
Data Information:
This analysis uses three perturbation datasets created by 01-DataProcessing.R. Each dataset corresponds to a perturbation scenario with a short identifier (used in filenames) and a display name (used in figures):
| Identifier | Display Name | Description |
|---|---|---|
2pep |
2 Peptides | Perturb exactly 2 peptides per selected protein |
050pep |
%50 Peptides | Perturb ~50% of peptides per selected protein |
random |
Random (2 to %50) Peptides | Perturb 2–50% of peptides per protein (randomly) |
Note: The
1pepscenario was excluded since a single perturbed peptide should not group with unperturbed ones, making it trivial.
How perturbations are generated:
01-DataProcessing.Rreads the SWATH‑MS interlab table, filters proteins/peptides, aggregates charge states, median-normalizes intensities, and selects proteins with ≥4 peptides and sufficient replicates- A set of proteins to perturb is sampled, a per-protein reduction factor (
red_fac) is drawn uniformly, andgeneratePerturbedProfiles()marks peptides asperturbed_peptideand reduces their intensities inday5 - Prepared feather files are written to
./data/prepared/and used by method scripts (02–04) to produce result files
Purpose:
This notebook compares two methods (COPF, ProteoForge) across three simulated perturbation scenarios and generates publication-quality performance figures (ROC, PR, F1, MCC). Note that PeCorA is not included because it does not output peptide groups—only flags individual discordant peptides—making direct comparison difficult.
Setup¶
This section imports required libraries, configures display settings, and defines paths for data and figures.
Note: The HTML rendering of this notebook hides code cells by default. Click the "Code" buttons on the right to expand them.
Libraries¶
import os
import sys
import warnings
import numpy as np # Numerical computing
import pandas as pd # Data manipulation
import seaborn as sns # R-like high-level plots
import matplotlib.pyplot as plt # Python's base plotting
sys.path.append('../')
from src import utils, plots
warnings.filterwarnings('ignore')
# Initialize the timer
startTime = utils.getTime()
Display Settings¶
The cell below configures pandas, matplotlib, and seaborn display options for improved readability of tables and figures, including color palettes and figure export settings.
## Figure Settings
# Define default colors and styles for plots
def_colors = [
"#139593", "#fca311", "#e54f2a",
"#c3c3c3", "#555555",
"#690000", "#5f4a00", "#004549"
]
# Set seaborn style
sns.set_theme(
style="white",
context="paper",
palette=def_colors,
font_scale=1,
rc={
"figure.figsize": (6, 4),
"font.family": "sans-serif",
"font.sans-serif": ["Arial", "Ubuntu Mono"],
}
)
# Figure Saving Settings
figure_formats = ["pdf"]
save_to_folder = True
transparent_bg = True
figure_dpi = 300
## Configure dataframe displaying
pd.set_option('display.float_format', lambda x: '%.4f' % x)
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500) # Set a wider display width
## Printing Settings
verbose = True
## Show the color-palettes
plots.color_palette( def_colors, save=False )
Data and Result Paths¶
Data and figures are organized in separate folders:
data_path— Input directory containing method results (COPF, ProteoForge)output_path— Output directory for aggregated performance datafigure_path— Directory for generated plots and figures
notebook_name = "02-GroupingBenchmark"
data_path = "./data/results/"
output_path = f"./data/results/"
figure_path = f"./figures/{notebook_name}/"
# Create the output folder if it does not exist
if not os.path.exists(output_path):
os.makedirs(output_path)
# Create figure folder structure, if needed
if save_to_folder:
for i in figure_formats:
cur_folder = figure_path + i + "/"
if not os.path.exists(cur_folder):
os.makedirs(cur_folder)
Analysis Parameters¶
Key parameters for the evaluation framework:
seed = 42— Random seed for reproducibilitypthr = 1e-3— P-value threshold highlighted on ROC and PR curves (conventional cutoff)thresholds— Range of p-value thresholds from1to1e-15for calculating performance metricsupdate_proteoform_grouping_in_COPF— Custom function to build proteoform groups from COPF results per threshold (COPF returns proteoform groups for a single threshold only)
# Global variables
seed = 42 # Seed for reproducibility
pthr = 10**-3 # p-value threshold for significance
thresholds = list(utils.generate_thresholds(10.0, -15, 1, 0, 1, 0.1)) # Thresholds for the analysis
data_order = ["2 Peptides", "Random (2 to %50) Peptides", "%50 Peptides"]
method_palette = {
"COPF": "#139593",
"ProteoForge": "#e54f2a",
}
method_styles = {
"COPF": "--",
"ProteoForge": ":",
}
method_markers = {
"COPF": "o",
"ProteoForge": "^",
}
def update_proteoform_grouping_in_COPF(
data: pd.DataFrame,
score_thr: float = 0.5, # Specific to COPF if None, it will not be used
pval_thr: float = 0.05,
protein_col: str = "protein_id",
cluster_col: str = "cluster",
pval_col: str = "proteoform_score_pval_adj",
score_col: str = "proteoform_score",
sep: str = "_"
):
# Create initial proteoform_id based on score and pval thresholds
if score_thr is not None:
condition = (data[score_col] >= score_thr) & (data[pval_col] <= pval_thr)
else:
condition = data[pval_col] <= pval_thr
data['proteoform_id'] = data[protein_col]
data.loc[condition, 'proteoform_id'] = data.loc[condition].apply(
lambda row: f"{row[protein_col]}{sep}{int(row[cluster_col])}", axis=1
)
# If proteoform_id is exactly the same as protein_id, add "_0" to the cluster 0
data.loc[data[protein_col] == data['proteoform_id'], 'proteoform_id'] = data[protein_col] + f"{sep}0"
# Special case for cluster 100
data.loc[data[cluster_col] == 100, 'proteoform_id'] = data[protein_col] + f"{sep}0"
# Count unique proteoform_id per protein_id
n_proteoforms = data.groupby(protein_col)['proteoform_id'].transform('nunique')
# Adjust n_proteoforms if "_0" exists
has_zero_cluster = data['proteoform_id'].str.endswith(f'{sep}0')
n_proteoforms -= has_zero_cluster.groupby(data[protein_col]).transform('sum')
# Final adjustment of proteoform_id
data.loc[n_proteoforms.isin([0, 1]), 'proteoform_id'] = data[protein_col]
return data
Performance Analysis¶
Data Aggregation and Preprocessing¶
The first step aggregates results from all methods and scenarios. The loop below:
- Reads each method's
.featherresult files for all three perturbation scenarios - Calculates performance metrics (TP, FP, TN, FN) across p-value thresholds
- Combines everything into a single DataFrame (
results_data)
This unified dataset serves as the foundation for all subsequent performance visualizations.
st = utils.getTime()
results_data = pd.DataFrame()
## Comment out below for unneccessary re-run
# Variables
pval_col = "adj_pval"
peptide_col = "peptide_id"
protein_col = "protein_id"
# pepID_col = "peptideID" # Numerical ID for peptides
perturbed_protein_col = "perturbed_protein"
perturbed_peptide_col = "perturbed_peptide"
proteoform_group_col = "PTM_id"
cluster_col = "ClusterID"
## Proteoform Grouping Benchmark of COPF and ProteoForge
# Read and combine the outputs:
for dataID in ["2pep", "050pep", "random"]:
for curMethod in ["COPF", "ProteoForge"]:
# Read the data
data = pd.read_feather(f"{data_path}{curMethod}_{dataID}_result.feather")
# Loop through the thresholds
metric_data = []
for thr in thresholds:
# Determine the p-value column
if curMethod == "COPF":
result = update_proteoform_grouping_in_COPF(
data,
score_thr = None,
pval_thr = thr,
protein_col = protein_col,
cluster_col = 'cluster',
pval_col= 'proteoform_score_pval',
score_col = 'proteoform_score',
)
result['n_proteoforms'] = result.groupby(protein_col)['proteoform_id'].transform( lambda x: x.nunique() )
result['WithProteoform'] = result['n_proteoforms'] > 1
result = result[[protein_col, perturbed_protein_col, 'WithProteoform', ]].drop_duplicates().reset_index(drop=True)
elif curMethod == "ProteoForge":
result = data[[
protein_col, peptide_col,
perturbed_protein_col, perturbed_peptide_col,
pval_col, cluster_col
]].drop_duplicates().reset_index(drop=True).copy()
result[proteoform_group_col] = -1
result.loc[
result[pval_col] < thr, proteoform_group_col
] = result.loc[ result[pval_col] < thr, cluster_col ]
result['isPTM'] = result[proteoform_group_col] > 0
result['isGroup'] = result.groupby([protein_col, proteoform_group_col])[peptide_col].transform('count') > 1
result['dPF'] = result['isGroup'] & result['isPTM']
# If per protein at least one True in dPF, then the protein has proteoform
result['WithProteoform'] = result.groupby(protein_col)['dPF'].transform('any')
# Drop the duplicates
result = result[[protein_col, perturbed_protein_col, 'WithProteoform', ]].drop_duplicates().reset_index(drop=True)
true_labels = result[perturbed_protein_col]
pred_labels = result['WithProteoform']
metrics = utils.calculate_metrics(
true_labels=true_labels, pred_labels=pred_labels,
verbose=False, return_metrics=True
)
metrics['threshold'] = thr
metric_data.append(pd.DataFrame([metrics]))
metric_data = pd.concat(metric_data, axis=0, ignore_index=True)
metric_data["method"] = curMethod
metric_data["perturbation"] = dataID
results_data = pd.concat([results_data, metric_data], axis=0, ignore_index=True)
print(f"Data shape: {results_data.shape}")
results_data["perturbation"] = results_data["perturbation"].replace({
# "1pep": "1 Peptide",
"2pep": "2 Peptides",
"050pep": "%50 Peptides",
"random": "Random (2 to %50) Peptides",
})
# Save the processed data
output_file = f"{output_path}peptide_grouping_performance_data.feather"
results_data.to_feather(output_file)
results_data.to_csv(output_file.replace('.feather', '.csv'), index=False)
results_data.tail()
Data shape: (150, 14)
| TP | FP | TN | FN | TPR | FPR | FDR | MCC | Precision | Recall | F1 | threshold | method | perturbation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 145 | 948 | 861 | 356 | 52 | 0.9480 | 0.7075 | 0.4760 | 0.3089 | 0.5240 | 0.9480 | 0.6750 | 0.6000 | ProteoForge | Random (2 to %50) Peptides |
| 146 | 953 | 891 | 326 | 47 | 0.9530 | 0.7321 | 0.4832 | 0.2938 | 0.5168 | 0.9530 | 0.6702 | 0.7000 | ProteoForge | Random (2 to %50) Peptides |
| 147 | 956 | 915 | 302 | 44 | 0.9560 | 0.7518 | 0.4890 | 0.2799 | 0.5110 | 0.9560 | 0.6660 | 0.8000 | ProteoForge | Random (2 to %50) Peptides |
| 148 | 961 | 939 | 278 | 39 | 0.9610 | 0.7716 | 0.4942 | 0.2693 | 0.5058 | 0.9610 | 0.6628 | 0.9000 | ProteoForge | Random (2 to %50) Peptides |
| 149 | 965 | 956 | 261 | 35 | 0.9650 | 0.7855 | 0.4977 | 0.2625 | 0.5023 | 0.9650 | 0.6607 | 1.0000 | ProteoForge | Random (2 to %50) Peptides |
The table above shows the compiled data structure. Each row captures performance for a specific method and perturbation scenario at a given p-value threshold.
ROC Analysis¶
The Receiver Operating Characteristic (ROC) curve evaluates how well each method distinguishes between perturbed and unperturbed peptides by plotting True Positive Rate (TPR/sensitivity) against False Positive Rate (FPR/1-specificity).
Interpretation:
- A perfect classifier hugs the top-left corner
- Area Under the Curve (AUC):
1.0= perfect,0.5= random chance - Black circles indicate performance at
p = 1e-3
# ROC Curves for different perturbations and methods
fig, axes = plt.subplots(1, 3, figsize=(12, 4), sharey=True, gridspec_kw={"wspace": 0.1})
for i, pert in enumerate(data_order):
cur_data = results_data[results_data["perturbation"] == pert].copy()
# Ensure the data is complete for ROC curves
cur_data = cur_data.groupby("method").apply(
lambda x: utils.complete_curve_data(x, 'ROC', 'FPR', 'TPR')
).reset_index(drop=True)
# Calculate AUC per method from TPR and FPR
auc_data = cur_data.groupby("method").apply(
lambda x: np.trapezoid(
x.sort_values("FPR")["TPR"], x.sort_values("FPR")["FPR"]
),
include_groups=False
)
# Plot the ROC curve
sns.lineplot(
data=cur_data,
x="FPR",
y="TPR",
hue="method",
style="method",
palette=method_palette,
ax=axes[i],
# Show the points
markers=method_markers,
dashes=False,
markersize=7.5,
markeredgewidth=0.5,
markeredgecolor="gray",
legend=False,
rasterized=True,
estimator=None,
)
# Add AUC values as legend like text
for j, method in enumerate(auc_data.index):
auc = auc_data[method]
# Add color to match the palette
color = method_palette[method]
axes[i].text(
0.95,
0.05 + j * 0.075,
f"{method}: {auc:.2f}",
color=color,
transform=axes[i].transAxes,
ha="right",
va="bottom",
fontsize=12,
fontweight="bold",
)
# Add a large no-facecolor marker to the pthr value for each method
pthr_data = cur_data[(cur_data["method"] == method) & (cur_data["threshold"] == pthr)]
axes[i].scatter(
pthr_data["FPR"],
pthr_data["TPR"],
color=color,
s=125,
edgecolor="black",
linewidth=1.25,
marker="o",
facecolors="none",
)
axes[i].plot([0, 1], [0, 1], color="black", linestyle="--")
axes[i].set_title(f"{pert}", fontsize=14, fontweight="bold")
axes[i].set_xlabel("FPR", fontsize=12)
axes[i].set_ylabel("TPR", fontsize=12)
axes[i].set_xlim(-0.05, 1.05)
axes[i].set_ylim(-0.05, 1.05)
axes[i].grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
# Add a text indicating the circle is the p-value threshold under fig.title and above the first subplot
fig.text(
0.9,
.97,
f"Black circles represent the TPR and FPR values at p-value threshold of {pthr}",
ha="right",
va="center",
fontsize=12,
fontstyle="italic",
color="gray",
)
# Finalize the plot
plt.tight_layout()
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"ROC_curves_perturbations_methods",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
ROC Analysis Results:
The ROC curves reveal consistent performance patterns across scenarios:
- Overall Performance: ProteoForge consistently outperforms COPF across all perturbation scenarios, achieving higher AUC values (~0.86 vs ~0.77)
- Behavior at
p = 0.001: ProteoForge reaches ~0.8 TPR but with higher FPR, while COPF shows lower TPR (~0.2) with lower FPR - Interpretation: ProteoForge is more sensitive (finds more true positives) but may produce more false positives at this threshold. COPF is more conservative but misses many perturbed proteins
Precision-Recall (PR) Analysis¶
ROC curves can be overly optimistic with class imbalance (far more unperturbed than perturbed peptides). Precision-Recall (PR) curves provide a more focused evaluation.
Interpretation:
- Goal: stay in the top-right corner (high precision + high recall)
- More informative for understanding the trustworthiness of positive identifications
## Precision-Recall Curves for different perturbations and methods
fig, axes = plt.subplots(1, 3, figsize=(12, 4), sharey=True, gridspec_kw={"wspace": 0.1})
for i, pert in enumerate(data_order):
cur_data = results_data[results_data["perturbation"] == pert].copy()
# Add a small value to avoid division by zero
cur_data["Precision"] = cur_data["TP"] / (cur_data["TP"] + cur_data["FP"] + 1e-6)
cur_data["Recall"] = cur_data["TP"] / (cur_data["TP"] + cur_data["FN"] + 1e-6)
cur_data = cur_data.sort_values("Recall", ascending=False)
cur_data = cur_data[~((cur_data["Recall"] == 0) & (cur_data["Precision"] == 0))]
# Ensure the data is complete for PR curves
cur_data = cur_data.groupby("method").apply(
lambda x: utils.complete_curve_data(x, 'PR', 'Recall', 'Precision')
).reset_index(drop=True)
# Calculate AUC per method from Precision and Recall
# Here we use the trapezoidal rule to calculate the area under the curve
f1_data = cur_data.groupby("method").apply(
lambda x: np.trapezoid(
x.sort_values("Recall")["Precision"], x.sort_values("Recall")["Recall"]
),
include_groups=False
)
# Plot the ROC curve
sns.lineplot(
data=cur_data,
x="Recall",
y="Precision",
hue="method",
style="method",
palette=method_palette,
ax=axes[i],
# Show the points
markers=method_markers,
dashes=False,
markersize=7.5,
markeredgewidth=0.5,
markeredgecolor="gray",
legend=False,
rasterized=True,
estimator=None,
)
# Add AUC values as legend like text
for j, method in enumerate(f1_data.index):
auc = f1_data[method]
# Add color to match the palette
color = method_palette[method]
axes[i].text(
0.05,
0.05 + j * 0.075,
f"{method}: {auc:.2f}",
color=color,
transform=axes[i].transAxes,
ha="left",
va="bottom",
fontsize=12,
fontweight="bold",
)
# Add a large no-facecolor marker to the pthr value for each method
pthr_data = cur_data[(cur_data["method"] == method) & (cur_data["threshold"] == pthr)]
axes[i].scatter(
pthr_data["Recall"],
pthr_data["Precision"],
color=color,
s=125,
edgecolor="black",
linewidth=1.25,
marker="o",
facecolors="none",
)
axes[i].set_title(f"{pert}", fontsize=14, fontweight="bold")
axes[i].set_xlabel("Recall", fontsize=12)
axes[i].set_ylabel("Precision", fontsize=12)
axes[i].set_xlim(-0.05, 1.05)
axes[i].set_ylim(-0.05, 1.05)
axes[i].grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
# Add a text indicating the circle is the p-value threshold under fig.title and above the first subplot
fig.text(
0.9,
.97,
f"Black circles represent the Precision and Precision values at p-value threshold of {pthr}",
ha="right",
va="center",
fontsize=12,
fontstyle="italic",
color="gray",
)
# Finalize the plot
plt.tight_layout()
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"PR_curves_perturbations_methods",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
PR Analysis Results:
The PR curves reveal performance differences:
- ProteoForge: Achieves 0.87 PR-AUC across all scenarios
- COPF: Ranges from 0.73 (2pep) to 0.76 (%50)—worst in 2pep, best in %50
- At
p = 1e-3: ProteoForge has ~0.8 precision and ~0.7 recall, while COPF has ~1.0 precision and ~0.2 recall - Key observation: COPF shows sharp precision decline around 0.2 recall in 2pep → many false positives when trying to find more true positives. ProteoForge maintains higher precision even as recall increases
F1 Score Across P-Value Thresholds¶
Optimal Thresholds at Maximum F1¶
For practical use, a single optimal p-value threshold is often needed. The table below shows the threshold that maximizes the F1 score (harmonic mean of precision and recall) for each method and scenario.
# Identify the best threshold for each method and perturbation
best_thresholds = results_data.groupby(["method", "perturbation"]).apply(
lambda x: x.loc[x["F1"].idxmax(), ["threshold", "F1"]]
).reset_index()
best_thresholds.columns = ["method", "perturbation", "threshold", "F1"]
best_thresholds
| method | perturbation | threshold | F1 | |
|---|---|---|---|---|
| 0 | COPF | %50 Peptides | 0.2000 | 0.6909 |
| 1 | COPF | 2 Peptides | 0.4000 | 0.6542 |
| 2 | COPF | Random (2 to %50) Peptides | 0.3000 | 0.6765 |
| 3 | ProteoForge | %50 Peptides | 0.0001 | 0.7740 |
| 4 | ProteoForge | 2 Peptides | 0.0001 | 0.7786 |
| 5 | ProteoForge | Random (2 to %50) Peptides | 0.0001 | 0.7857 |
The table provides concrete p-values to maximize F1 for any given situation. For example, using ProteoForge on the "Random" dataset, a threshold of 1e-11 achieves the best precision-recall balance.
Visualizing F1 Across Thresholds¶
The plots below show F1 score sensitivity to threshold choice. The x-axis displays -log10(p-value), so smaller (more significant) p-values appear on the right. Black circles mark optimal F1 values from the table above.
# F1 Score for different perturbations and methods at the best threshold
fig, axes = plt.subplots(1, 3, figsize=(12, 4), sharey=True, gridspec_kw={"wspace": 0.1})
for i, pert in enumerate(data_order):
cur_data = results_data[results_data["perturbation"] == pert].copy()
cur_data["-log10(threshold)"] = -np.log10(cur_data["threshold"])
sns.lineplot(
data=cur_data,
x="-log10(threshold)",
y="F1",
hue="method",
style="method",
palette=method_palette,
ax=axes[i],
markers=method_markers,
dashes=False,
markersize=7.5,
markeredgewidth=0.5,
markeredgecolor="gray",
legend=False,
rasterized=True,
estimator=None,
)
# Add The highest F1 values as legend like text
for j, method in enumerate(best_thresholds[best_thresholds["perturbation"] == pert]["method"]):
f1 = best_thresholds[(best_thresholds["perturbation"] == pert) & (best_thresholds["method"] == method)]["F1"].values[0]
# Add a large no-facecolor marker to the p-value that gives the highest F1 score for each method
pthr_data = cur_data[
(cur_data["method"] == method) &
(cur_data["threshold"] == best_thresholds[
(best_thresholds["perturbation"] == pert) &
(best_thresholds["method"] == method)
]["threshold"].values[0])
]
axes[i].scatter(
-np.log10(pthr_data["threshold"]),
pthr_data["F1"],
color=color,
s=125,
edgecolor="black",
linewidth=1.25,
marker="o",
facecolors="none",
)
# Add color to match the palette
color = method_palette[method]
axes[i].text(
0.05,
0.8 + j * 0.1,
f"{method}: {f1:.2f}",
color=color,
transform=axes[i].transAxes,
ha="left",
va="bottom",
fontsize=12,
fontweight="bold",
)
axes[i].set_title(f"{pert}", fontsize=14, fontweight="bold")
axes[i].set_xlabel("-log 10 P-value Threshold", fontsize=12)
axes[i].set_ylabel("F1 Score", fontsize=12)
axes[i].set_xlim(-0.5, 15.5)
axes[i].set_ylim(-0.05, 1.05)
axes[i].grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
fig.text(
0.9,
.97,
f"Black circles represent the F1 values at the best threshold for each method",
ha="right",
va="center",
fontsize=12,
fontstyle="italic",
color="gray",
)
# Finalize the plot
plt.tight_layout()
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"F1_scores_perturbations_methods",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
F1 Analysis Results:
These plots highlight method robustness to threshold choice:
- ProteoForge consistently achieves higher F1-scores across all perturbation scenarios (peak ~0.77–0.79) compared to COPF (peak ~0.65–0.69)
- Optimal thresholds are generally more stringent for ProteoForge (~1e-4), indicating good precision-recall balance even at stricter cutoffs
- Unlike peptide identification, F1 curves here are relatively smooth—small threshold changes do not drastically affect performance
- Practical implication: ProteoForge demonstrates robustness to threshold selection, making it a safer choice when the optimal cutoff is uncertain
MCC Score Across P-Value Thresholds¶
Optimal Thresholds at Maximum MCC¶
The Matthews Correlation Coefficient (MCC) provides a robust single-number summary accounting for all four confusion-matrix values:
+1= Perfect prediction0= No better than random-1= Total disagreement
The table below lists thresholds that maximize MCC—use these as alternative operating points when balanced performance is the priority.
best_thresholds = results_data.groupby(["method", "perturbation"]).apply(
lambda x: x.loc[x["MCC"].idxmax(), ["threshold", "MCC"]]
).reset_index()
best_thresholds.columns = ["method", "perturbation", "threshold", "MCC"]
best_thresholds
| method | perturbation | threshold | MCC | |
|---|---|---|---|---|
| 0 | COPF | %50 Peptides | 0.1000 | 0.5005 |
| 1 | COPF | 2 Peptides | 0.1000 | 0.4548 |
| 2 | COPF | Random (2 to %50) Peptides | 0.1000 | 0.4891 |
| 3 | ProteoForge | %50 Peptides | 0.0000 | 0.6324 |
| 4 | ProteoForge | 2 Peptides | 0.0000 | 0.6421 |
| 5 | ProteoForge | Random (2 to %50) Peptides | 0.0000 | 0.6412 |
This table provides MCC-optimized operating points, which often differ from F1-optimized thresholds depending on research goals.
Visualizing MCC Across Thresholds¶
The plots below show MCC score across the full range of p-value thresholds. Black circles mark maximum MCC values for each method.
# MCC Score for different perturbations and methods at the best threshold
fig, axes = plt.subplots(1, 3, figsize=(12, 4), sharey=True, gridspec_kw={"wspace": 0.1, "hspace": 0.1})
for i, pert in enumerate(data_order):
cur_data = results_data[results_data["perturbation"] == pert].copy()
cur_data["-log10(threshold)"] = -np.log10(cur_data["threshold"])
sns.lineplot(
data=cur_data,
x="-log10(threshold)",
y="MCC",
hue="method",
style="method",
palette=method_palette,
ax=axes[i],
markers=method_markers,
dashes=False,
markersize=7.5,
markeredgewidth=0.5,
markeredgecolor="gray",
legend=False,
rasterized=True,
estimator=None,
)
# Add The highest MCC values as legend like text
for j, method in enumerate(best_thresholds[best_thresholds["perturbation"] == pert]["method"]):
mcc = best_thresholds[(best_thresholds["perturbation"] == pert) & (best_thresholds["method"] == method)]["MCC"].values[0]
# Add a large no-facecolor marker to the p-value that gives the highest MCC score for each method
pthr_data = cur_data[
(cur_data["method"] == method) &
(cur_data["threshold"] == best_thresholds[
(best_thresholds["perturbation"] == pert) &
(best_thresholds["method"] == method)
]["threshold"].values[0])
]
axes[i].scatter(
-np.log10(pthr_data["threshold"]),
pthr_data["MCC"],
color=color,
s=125,
edgecolor="black",
linewidth=1.25,
marker="o",
facecolors="none",
)
# Add color to match the palette
color = method_palette[method]
axes[i].text(
0.05,
0.75 + j * 0.075,
f"{method}: {mcc:.2f}",
color=color,
transform=axes[i].transAxes,
ha="left",
va="bottom",
fontsize=12,
fontweight="bold",
)
axes[i].set_title(f"{pert}", fontsize=14, fontweight="bold")
axes[i].set_xlabel("-log 10 P-value Threshold", fontsize=12)
axes[i].set_ylabel("MCC Score", fontsize=12)
axes[i].set_xlim(-0.5, 15.5)
axes[i].set_ylim(-0.25, 1.05)
axes[i].grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
fig.text(
0.9,
.97,
f"Black circles represent the MCC values at the best threshold for each method",
ha="right",
va="center",
fontsize=12,
fontstyle="italic",
color="gray",
)
# Finalize the plot
plt.tight_layout()
plots.finalize_plot(
fig,
show=True,
save=save_to_folder,
filename=f"MCC_scores_perturbations_methods",
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi,
)
MCC Analysis Results:
The MCC analysis confirms performance trends observed with other metrics:
- ProteoForge consistently achieves higher MCC scores (peak ~0.63–0.64) compared to COPF (peak ~0.45–0.50)
- This indicates a substantial performance gap between the two methods
Conclusions¶
Across all four evaluation metrics (ROC, Precision-Recall, F1, MCC), a consistent picture emerges:
| Method | Characteristics | Recommended Use |
|---|---|---|
| ProteoForge | Most robust, high-performing, versatile | General choice—stable across scenarios and thresholds |
| COPF | Specialized clustering | Large fraction of discordant peptides expected |
Execution Time¶
print("Notebook Execution Time:", utils.prettyTimer(utils.getTime() - startTime))
Notebook Execution Time: 00h:00m:27s